Stochastic Gradient Descent: Going As Fast As Possible But Not Faster
نویسندگان
چکیده
When applied to training deep neural networks, stochastic gradient descent (SGD) often incurs steady progression phases, interrupted by catastrophic episodes in which loss and gradient norm explode. A possible mitigation of such events is to slow down the learning process. This paper presents a novel approach to control the SGD learning rate, that uses two statistical tests. The first one, aimed at fast learning, compares the momentum of the normalized gradient vectors to that of random unit vectors and accordingly gracefully increases or decreases the learning rate. The second one is a change point detection test, aimed at the detection of catastrophic learning episodes; upon its triggering the learning rate is instantly halved. Both abilities of speeding up and slowing down the learning rate allows the proposed approach, called SALERA, to learn as fast as possible but not faster. Experiments on standard benchmarks show that SALERA performs well in practice, and compares favorably to the state of the art. Machine Learning (ML) algorithms require efficient optimization techniques, whether to solve convex problems (e.g., for SVMs), or non-convex ones (e.g., for Neural Networks). In the convex setting, the main focus is on the order of the convergence rate [Nesterov, 1983, Defazio et al., 2014]. In the non-convex case, ML is still more of an experimental science. Significant efforts are devoted to devising optimization algorithms (and robust default values for the associated hyper-parameters) tailored to the typical regime of ML models and problem instances (e.g. deep convolutional neural networks for MNIST [Le Cun et al., 1998] or ImageNet [Deng et al., 2009]) [Duchi et al., 2010, Zeiler, 2012, Schaul et al., 2013, Kingma and Ba, 2014, Tieleman and Hinton, 2012]. As the data size and the model dimensionality increase, mainstream convex optimization methods are adversely affected. Hessian-based approaches, which optimally handle convex optimization problems however ill-conditioned they are, do not scale up and approximations are required [Martens et al., 2012]. Overall, Stochastic Gradient Descent (SGD) is increasingly adopted both in convex and non-convex settings, with good performances and linear tractability [Bottou and Bousquet, 2008, Hardt et al., 2015]. Within the SGD framework, one of the main issues is to know how to control the learning rate: the objective is to reach a satisfactory learning speed without triggering any catastrophic event, manifested by the sudden rocketing of the training loss and the gradient norm. Finding "how much is not too much" in terms of learning rate is a slippery game. It depends both on the current state of the system (the weight vector) and the current mini-batch. Often, the eventual convergence of SGD is ar X iv :1 70 9. 01 42 7v 1 [ st at .M L ] 5 S ep 2 01 7 ensured by decaying the learning rate as O(t) [Robbins and Monro, 1951, Defazio et al., 2014] or O( √ t) [Zinkevich, 2003] with the number t of mini-batches. While learning rate decay effectively prevents catastrophic events, it is a main cause for the days or weeks of computation behind the many breakthroughs of deep learning. Many and diverse approaches have thus been designed to achieve the learning rate adaptation [Amari, 1998, Duchi et al., 2010, Schaul et al., 2013, Kingma and Ba, 2014, Tieleman and Hinton, 2012, Andrychowicz et al., 2016] (more in Section 1). This paper proposes a novel approach to adaptive SGD, called SALERA (Safe Agnostic LEraning Rate Adaptation). SALERA is based on the conjecture that, if learning catastrophes are well taken care of, the learning process can speed up whenever successive gradient directions show general agreement about the direction to go. The frequent advent of catastrophic episodes, long observed by neural net practitioners [Goodfellow et al., 2016, Chapter 8] raises the question of how to best mitigate their impact. The answer depends on whether these events could be anticipated with some precision. Framing catastrophic episodes as random events1, we adopt a purely curative strategy (as opposed to a preventive one): detecting and instantly curing catastrophic episodes. Formally, a sequential cumulative sum change detection test, the Page-Hinkley (PH) test [Page, 1954, Hinkley, 1970] is adapted and used to monitor the learning curve reporting the minibatch losses. If a change in the learning curve is detected, the system undergoes an instant cure by halving the learning rate and backtracking to its former state. Such instant cure can be thought of in terms of a dichotomic approximation of line search (see e.g. Defazio et al. [2014], Eq. (3)). Once the risk of catastrophic episodes is well addressed, the learning rate can be adapted in a more agile manner: the ALERA (Agnostic LEarning Rate Adaptation) process increases (resp. decreases) the learning rate whenever the correlation among successive gradient directions is higher (resp. lower) than random, by comparing the actual gradient momentum and the agnostic momentum built from random unit vectors. The contribution of the paper is twofold. First, it proposes an original and efficient way to control learning dynamics (section 2.1). Secondly, it opens a new approach for handling catastrophic events and salvaging a significant part of doomed-to-fail runs (section 2.2). The experimental validation thereof compares favorably with the state of the art on the MNIST and CIFAR-10 benchmarks (section 3).
منابع مشابه
Conjugate gradient neural network in prediction of clay behavior and parameters sensitivities
The use of artificial neural networks has increased in many areas of engineering. In particular, this method has been applied to many geotechnical engineering problems and demonstrated some degree of success. A review of the literature reveals that it has been used successfully in modeling soil behavior, site characterization, earth retaining structures, settlement of structures, slope stabilit...
متن کاملLsh-sampling Breaks the Computa- Tional Chicken-and-egg Loop in Adap- Tive Stochastic Gradient Estimation
Stochastic Gradient Descent or SGD is the most popular optimization algorithm for large-scale problems. SGD estimates the gradient by uniform sampling with sample size one. There have been several other works that suggest faster epoch wise convergence by using weighted non-uniform sampling for better gradient estimates. Unfortunately, the per-iteration cost of maintaining this adaptive distribu...
متن کاملConjugate Directions for Stochastic Gradient Descent
The method of conjugate gradients provides a very effective way to optimize large, deterministic systems by gradient descent. In its standard form, however, it is not amenable to stochastic approximation of the gradient. Here we explore ideas from conjugate gradient in the stochastic (online) setting, using fast Hessian-gradient products to set up low-dimensional Krylov subspaces within individ...
متن کاملTowards Stochastic Conjugate Gradient Methods
The method of conjugate gradients provides a very effective way to optimize large, deterministic systems by gradient descent. In its standard form, however, it is not amenable to stochastic approximation of the gradient. Here we explore a number of ways to adopt ideas from conjugate gradient in the stochastic setting, using fast Hessian-vector products to obtain curvature information cheaply. I...
متن کاملCombining Conjugate Direction Methods with Stochastic Approximation of Gradients
The method of conjugate directions provides a very effective way to optimize large, deterministic systems by gradient descent. In its standard form, however, it is not amenable to stochastic approximation of the gradient. Here we explore ideas from conjugate gradient in the stochastic (online) setting, using fast Hessian-gradient products to set up low-dimensional Krylov subspaces within indivi...
متن کاملAccelerating Stochastic Gradient Descent
There is widespread sentiment that fast gradient methods (e.g. Nesterov’s acceleration, conjugate gradient, heavy ball) are not effective for the purposes of stochastic optimization due to their instability and error accumulation. Numerous works have attempted to quantify these instabilities in the face of either statistical or non-statistical errors (Paige, 1971; Proakis, 1974; Polyak, 1987; G...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1709.01427 شماره
صفحات -
تاریخ انتشار 2017